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Abstract 

We describe an implemented system for robust 
domain-independent syntactic parsing of English, 
using a unification-based grammar of part-of- 
speech and punctuation labels coupled with a 
probabilistic LR parser. We present evaluations 
of the system's performance along several differ- 
ent dimensions; these enable us to assess the con- 
tribution that each individual part is making to 
the success of the system as a whole, and thus 
prioritise the effort to be devoted to its further 
enhancement. Currently, the system is able to 
parse around 80% of sentences in a substantial 
corpus of general text containing a number of 
distinct genres. On a random sample of 250 
such sentences the system has a mean crossing 
bracket rate of 0.71 and recall and precision of 
83% and 84% respectively when evaluated against 
manually-disambiguated analyses^. 

1. INTRODUCTION 

This work is part of an effort to develop a ro- 
bust, domain- independent syntactic parser capa- 
ble of yielding the unique correct analysis for un- 
restricted naturally-occurring input. Our goal is 
to develop a system with performance compara- 
ble to extant part-of-speech taggers, returning a 
syntactic analysis from which predicate-argument 

1 Some of this work was carried out while the 
second author was visiting Rank Xerox, Grenoble. 
The work was also supported by UK DTI/SALT 
project 41/5808 'Integrated Language Database', and 
by SERC/EPSRC Advanced Fellowships to both au- 
thors. Geoff Nunberg provided encouragement and 
much advice on the analysis of punctuation, and Greg 
Grefenstette undertook the original corpus tokenisa- 
tion and segmentation for the punctuation experi- 
ments. Bernie Jones and Kiku Ribas made helpful 
comments on an earlier draft. We are responsible for 
any mistakes. 



structure can be recovered, and which can sup- 
port semantic interpretation. The requirement for 
a domain-independent analyser favours statistical 
techniques to resolve ambiguities, whilst the lat- 
ter goal favours a more sophisticated grammatical 
formalism than is typical in statistical approaches 
to robust analysis of corpus material. 

Briscoe & Carroll (1993) describe a proba- 
blistic parser using a wide-coverage unification- 
based grammar of English written in the Alvey 
Natural Language Tools (ANLT) metagrammat- 
ical formalism (Briscoe et at, 1987), generating 
around 800 rules in a syntactic variant of the Def- 
inite Clause Grammar formalism (DCG, Pereira 
& Warren, 1980) extended with iterative (Kleene) 
operators. The ANLT grammar is linked to a lex- 
icon containing about 64K entries for 40K lex- 
emes, including detailed subcategorisation infor- 
mation appropriate for the grammar, built semi- 
automatically from a learners' dictionary (Car- 
roll & Grover, 1989). The resulting parser is 
efficient, constructing a parse forest in roughly 
quadratic time (empirically), and efficiently re- 
turning the ranked n-most likely analyses (Car- 
roll, 1993, 1994). The probabilistic model is a 
refinement of probabilistic context-free grammar 
(PCFG) conditioning CF 'backbone' rule applica- 
tion on LR state and lookahead item. Unification 
of the 'residue' of features not incorporated into 
the backbone is performed at parse time in con- 
junction with reduce operations. Unification fail- 
ure results in the associated derivation being as- 
signed a probability of zero. Probabilities are as- 
signed to transitions in the LALR(l) action table 
via a process of supervised training based on com- 
puting the frequency with which transitions are 
traversed in a corpus of parse histories. The result 
is a probabilistic parser which, unlike a PCFG, is 
capable of probabilistically discriminating deriva- 
tions which differ only in terms of order of appli- 



cation of the same set of CF backbone rules, due 
to the parse context defined by the LR table. 

Experiments with this system revealed three 
major problems which our current research is ad- 
dressing. Firstly improvements in probabilistic 
parse selection will require a 'lexicalised' gram- 
mar/parser in which (minimally) probabilities are 
associated with alternative subcategorisation pos- 
sibilities of individual lexical items. Currently, the 
relative frequency of subcategorisation possibili- 
ties for individual lexical items is not recorded in 
wide-coverage lexicons, such as ANLT or COM- 
LEX (Grishman et at, 1994). Secondly, removal 
of punctuation from the input (after segmen- 
tation into text sentences) worsens performance 
as punctuation both reduces syntactic ambigu- 
ity (Jones, 1994) and signals non-syntactic (dis- 
course) relations between text units (Nunberg, 
1990). Thirdly, the largest source of error on un- 
seen input is the omission of appropriate subcate- 
gorisation values for lexical items (mostly verbs), 
preventing the system from finding the correct 
analysis. The current coverage — the proportion 
of sentences for which at least one analysis was 
found0 — of this system on a general corpus (e.g. 
Brown or LOB) is estimated to be around 20% 
by Briscoe (1994). Therefore, we have developed 
a variant probabilistic LR parser which does not 
rely on subcategorisation and uses punctuation to 
reduce ambiguity. The analyses produced by this 
parser can be utilised for phrase-finding applica- 
tions, recovery of subcategorisation frames, and 
other 'intermediate' level parsing problems. 

2. PART-OF-SPEECH TAG 
SEQUENCE GRAMMAR 

We utilised the ANLT metagrammatical formal- 
ism to develop a feature-based, declarative de- 
scription of part-of-speech (PoS) label sequences 
(see e.g. Church, 1988) for English. This gram- 
mar compiles into a DCG-like grammar of ap- 
proximately 400 rules. It has been designed 
to enumerate possible valencies for predicates 
(verbs, adjectives and nouns) by including sep- 
arate rules for each pattern of possible comple- 
mentation in English. The distinction between ar- 
guments and adjuncts is expressed, following X- 
bar theory (e.g. Jackendoff, 1977), by Chomsky- 
adjunction of adjuncts to maximal projections 

2 Briscoe & Carroll (1995) note that "coverage" is 
a weak measure since discovery of one or more global 
analyses does not entail that the correct analysis is 
recovered. 



(XP — ► XP Adjunct) as opposed to government of 
arguments (i.e. arguments are sisters within XI 
projections; XI — > X0 Argl. . . ArgN). Although 
the grammar enumerates complementation pos- 
sibilities and checks for global sentential well- 
formedness, it is best described as 'intermediate' 
as it does not attempt to associate 'displaced' con- 
stituents with their canonical position / grammat- 
ical role. 

The other difference between this grammar 
and a more conventional one is that it incorporates 
some rules specifically designed to overcome lim- 
itations or idiosyncrasies of the tagging process. 
For example, past participles functioning adjec- 
tivally, as in (la), are frequently tagged as past 
participles (VVN) as in (lb), so the grammar in- 
corporates a rule (violating X-bar theory) which 
parses past participles as adjectival premodifiers 
in this context. 

(1) a The disembodied head 

b Thc_AT disembodied.VVN head.NNl 

Similar idiosyncratic rules are incorporated for 
dealing with gerunds, adjective-noun conversions, 
idiom sequences, and so forth. Further details of 
the PoS grammar are given in Briscoe & Carroll 
(1994, 1995). 

The grammar currently covers around 80% of 
the Susanne corpus (Sampson, 1995), a 138K word 
treebanked and balanced subset of the Brown cor- 
pus. Many of the 'failures' are due to the root 
S(entence) requirement enforced by the parser 
when dealing with fragments from dialogue and 
so forth. We have not relaxed this requirement 
since it increases ambiguity our primary interest 
at this point being the extraction of subcategorisa- 
tion information from full clauses in corpus data. 



3. TEXT GRAMMAR AND 
PUNCTUATION 

Nunberg (1990) develops a partial 'text' grammar 
for English which incorporates many constraints 
that (ultimately) restrict syntactic and seman- 
tic interpretation. For example, textual adjunct 
clauses introduced by colons scope over following 
punctuation, as (2a) illustrates; whilst textual ad- 
juncts introduced by dashes cannot intervene be- 
tween a bracketed adjunct and the textual unit to 
which it attaches, as in (2b). 



(2) a *He told them his reason: he would not 

renegotiate his contract, but he did not 
explain to the team owners, (vs. but 
would stay) 
b *Shc left - who could blame her - (dur- 
ing the chainsaw scene) and went home. 

We have developed a declarative grammar in 
the ANLT metagrammatical formalism, based on 
Nunberg's procedural description. This grammar 
captures the bulk of the text-sentential constraints 
described by Nunberg with a grammar which com- 
piles into 26 DCG-like rules. Text grammar anal- 
yses are useful because they demarcate some of 
the syntactic boundaries in the text sentence and 
thus reduce ambiguity, and because they identify 
the units for which a syntactic analysis should, in 
principle, be found; for example, in (3), the ab- 
sence of dashes would mislead a parser into seek- 
ing a syntactic relationship between three and the 
following names, whilst in fact there is only a dis- 
course relation of elaboration between this text 
adjunct and pronominal three. 

(3) The three - Miles J. Cooperman, Sheldon 
Teller, and Richard Austin - and eight 
other defendants were charged in six in- 
dictments with conspiracy to violate fed- 
eral narcotic law. 

Further details of the text grammar are given 
in Briscoe & Carroll (1994, 1995). The text 
grammar has been tested on the Susanne corpus 
and covers 99.8% of sentences. (The failures are 
mostly text segmentation problems). The number 
of analyses varies from one (71%) to the thousands 
(0.1%). Just over 50% of Susanne sentences con- 
tain some punctuation, so around 20% of the sin- 
gleton parses are punctuated. The major source of 
ambiguity in the analysis of punctuation concerns 
the function of commas and their relative scope as 
a result of a decision to distinguish delimiters and 
separators (Nunberg 1990:36). Therefore, a text 
sentence containing eight commas (and no other 
punctuation) will have 3170 analyses. The mul- 
tiple uses of commas cannot be resolved without 
access to (at least) the syntactic context of occur- 
rence. 

4. THE INTEGRATED 
GRAMMAR 

Despite Nunberg's observation that text grammar 
is distinct from syntax, text grammatical ambigu- 
ity favours interleaved application of text gram- 



matical and syntactic constraints. Integrating the 
text and the PoS sequence grammars is straight- 
forward and the result remains modular, in that 
the text grammar is 'folded into' the PoS sequence 
grammar, by treating text and syntactic categories 
as overlapping and dealing with the properties of 
each using disjoint sets of features, principles of 
feature propagation, and so forth. In addition to 
the core text-grammatical rules which carry over 
unchanged from the stand-alone text grammar, 44 
syntactic rules (of pre- and post- posing, and co- 
ordination) now include (often optional) comma 
markers corresponding to the purely 'syntactic' 
uses of punctuation. 

The approach to text grammar taken here is in 
many ways similar to that of Jones (1994). How- 
ever, he opts to treat punctuation marks as clitics 
on words which introduce additional fcatural in- 
formation into standard syntactic rules. Thus, his 
grammar is thoroughly integrated and it would be 
harder to extract an independent text grammar 
or build a modular semantics. Our less-tightly in- 
tegrated grammar is described in more detail in 
Briscoe & Carroll (1994). 

5. PARSING THE SUSANNE AND 
SEC CORPORA 

We have used the integrated grammar to parse 
the Susanne corpus and the quite distinct Spoken 
English Corpus (SEC; Taylor & Knowles, 1988), a 
50K word treebanked corpus of transcribed British 
radio programmes punctuated by the corpus com- 
pilers. Both corpora were retagged using the Ac- 
quilex HMM tagger (Elworthy, 1993, 1994) trained 
on text tagged with a slightly modified version of 
CLAWS-II labels (Garside et at, 1987). In con- 
trast to previous systems taking as input fully- 
determinate sequences of PoS labels, such as Fid- 
ditch (Hindle, 1989) and MITFP (de Marcken, 
1990), for each word the tagger returns multiple 
label hypotheses, and each is thresholded before 
being passed on to the parser: a given label is re- 
tained if it is the highest-ranked, or, if the highest- 
ranked label is assigned a likelihood of less than 
0.9, if its likelihood is within a factor of 50 of this. 
We thus attempt to minimise the effect of incor- 
rect tagging on the parsing component by allow- 
ing label ambiguities, but control the increase in 
indeterminacy and concomitant decrease in subse- 
quent processing efficiency by applying the thresh- 
olding technique. On Susanne, retagging allowing 
only a single label per word results in a 97.90% 
label/word assignment accuracy, whereas multi- 



label tagging with this thresholding scheme results 
in 99.51% accuracy. 

In an earlier paper (Briscoe & Carroll, 1995) 
we gave results for a previous version of the gram- 
mar and parsing system. We have made a num- 
ber of significant improvements to the system since 
then, the most fundamental being the use of multi- 
ple labels for each word. System accuracy evalua- 
tion results are also improved since we now output 
trees that conform more closely to the annotation 
conventions employed in the test treebank. 

COVERAGE AND AMBIGUITY 

To examine the efficiency and coverage of the 
grammar we applied it to our retagged versions of 
Susanne and SEC. We used the ANLT chart parser 
(Carroll, 1993), but modified just to count the 
number of possible parses in the parse forests (Bil- 
lot & Lang, 1989) rather than actually unpacking 
them. We also imposed a per-sentence time-out 
of 30 seconds CPU time, running in Franz Alle- 
gro Common Lisp 4.2 on an HP PA-RISC 715/100 
workstation with 128 Mbytes of physical memory. 

For both corpora, the majority of sentences 
analysed successfully received under 100 parses, 
although there is a long tail in the distribu- 
tion. Monitoring this distribution is helpful during 
grammar development to ensure that coverage is 
increasing but the ambiguity rate is not. A more 
succinct though less intuitive measure of ambigu- 
ity rate for a given corpus is Briscoe & Carroll's 
(1995) average parse base (APB), defined as the 
geometric mean over all sentences in the corpus 
of ^/p, where n is the number of words in a sen- 
tence, and p, the number of parses for that sen- 
tence. Thus, given a sentence n words long, the 
APB raised to the nth power gives the number of 
analyses that the grammar can be expected to as- 
sign to a sentence of that length in the corpus. Ta- 
ble [l] gives these measures for all of the sentences 
in Susanne and in SEC. 

As the grammar was developed solely with ref- 
erence to Susanne, coverage of SEC is quite robust. 
The two corpora differ considerably since the for- 
mer is drawn from American written text whilst 
the latter represents British transcribed spoken 
material. The corpora overall contain material 
drawn from widely disparate genres / registers, 
and are more complex than those used in DARPA 
ATIS tests, and more diverse than those used 
in MUCs and probably also the Penn Treebank. 
Black et al. (1993) report a coverage of around 
95% on computer manuals, as opposed to our cov- 



erage rate of 70-80% on much more heterogeneous 
data and longer sentences. The APBs for Susanne 
and SEC of 1.313 and 1.300 respectively indicate 
that sentences of average length in each corpus 
could be expected to be assigned of the order of 
238 and 376 analyses (i.e. 1.313 20 1 and 1.300 22 6 ). 

The parser throughput on these tests, for sen- 
tences successfully analysed, is around 25 words 
per CPU second on an HP PA-RISC 715/100. 
Sentences of up to 30 tokens (words plus sentence- 
internal punctuation) are parsed in an average of 
under 1 second each, whilst those around 60 tokens 
take on average around 7 seconds. Nevertheless, 
the relationship between sentence length and pro- 
cessing time is fitted well by a quadratic function, 
supporting the findings of Carroll (1994) that in 
practice NL grammars do not evince worst-case 
parsing complexity. 

Grammar Development &; Refinement 

The results we report above relate to the latest 
version of the tag sequence grammar. To date, we 
have spent about one person-year on grammar de- 
velopment, with the effort spread fairly evenly over 
a two-and-a-half-year period. The various phases 
in the development and refinement of the grammar 
can be observed in an analysis of the coverage and 
APB for Susanne and SEC over this period — see 
table ^. The phases, with dates, were: 

6 / 92—1 1/93 Initial development of the grammar. 

11/93—7/94 Substantial increase in coverage on 
the development corpus (Susanne), correspond- 
ing to a drive to increase the general coverage 
of the grammar by analysing parse failures on 
actual corpus material. From a lower initial fig- 
ure, coverage of SEC (unseen corpus), increased 
by a larger factor. 

7/94—12/94 Incremental improvements in cover- 
age, but at the cost of increasing the ambiguity 
of the grammar. 

12/ 94—10/ 95 Improving the accuracy of the sys- 
tem by trying to ensure that the correct analysis 
was in the set returned. 

Since the coverage on SEC is increasing at the 
same time as on Susanne, we can conclude that 
the grammar has not been specifically tuned to 
the particular sublanguages or genres represented 
in the development corpus. Also, although the 
almost-50% initial coverage on the heterogeneous 





Susanne 


SEC 


Parse fails 


1476 


21.0% 


809 


31.3% 


1-9 parses 


1436 


20.5% 


477 


18.4% 


10-99 parses 


1218 


17.4% 


378 


14.6% 


100-999 parses 


953 


13.6% 


276 


10.7% 


1K-9.9K parses 


694 


9.9% 


225 


8.7% 


10K-99K parses 


474 


6.8% 


154 


6.0% 


100K+ parses 


750 


10.7% 


264 


10.2% 


Time-outs 


13 


0.2% 


4 


0.2% 


Number of sentences 


7014 




2717 




Mean sentence length (MSL) 


20.1 




22.6 




MSL - fails 


20.9 




29.5 




MSL - time-outs 


73.6 




65.8 




Average Parse Base 


1.313 




1.300 





Table 1: Grammar coverage on Susanne and SEC 





Susanne 




SEC 


date 


coverage APB 201 


coverage 


11/93 


47.8% 


667 


34.3% 


1/94 


56.7% 


160 


45.7% 


7/94 


75.3% 


192 


67.1% 


12/94 


79.0% 


217 


68.9% 


10/95 


79.0% 


238 


68.7% 



Table 2: Grammar coverage and ambiguity during 
development 



text of Susanne compares well with the state-of- 
the-art in grammar-based approaches to NL anal- 
ysis (e.g. see Taylor et ai, 1989; Alshawi et at, 
1992), it is clear that the subsequent grammar re- 
finement phases have led to major improvements 
in coverage and reductions in spurious ambiguity. 

We have experimented with increasing the 
richness of the lexical feature set by incorporating 
subcatcgorisation information for verbs into the 
grammar and lexicon. We constructed randomly 
from Susanne a test corpus of 250 in-coverage sen- 
tences, and in this, for each word tagged as pos- 
sibly being an open-class verb (i.e. not a modal 
or auxiliary) we extracted from the ANLT lexi- 
con (Carroll & Grover, 1989) all verbal entries for 
that word. We then mapped these entries into 
our PoS grammar experimental subcategorisation 
scheme, in which we distinguished each possible 
pattern of complementation allowed by the gram- 
mar (but not control relationships, specification 
of prepositional heads of PP complements etc. as 
in the full ANLT representation scheme). We 
then attempted to parse the test sentences, us- 
ing the derived verbal entries instead of the orig- 



inal generic entries which generalised over all the 
subcategorisation possibilities. 31 sentences now 
failed to receive a parse, a decrease in coverage of 
12%. This is due to the fact that the ANLT lexi- 
con, although large and comprehensive by current 
standards (Briscoe & Carroll, 1996), nevertheless 
contains many errors of omission. 

PARSE SELECTION 

A probabilistic LR parser was trained with the in- 
tegrated grammar by exploiting the Susanne tree- 
bank bracketing. An LR parser (Briscoe & Car- 
roll, 1993) was applied to unlabelled bracketed 
sentences from the Susanne treebank, and a new 
treebank of 1758 correct and complete analyses 
with respect to the integrated grammar was con- 
structed semi-automatically by manually resolving 
the remaining ambiguities. 250 sentences from the 
new treebank, selected randomly, were kept back 
for testing^. The remainder, together with a fur- 
ther set of analyses from 2285 treebank sentences 
that were not checked manually, were used to 
train a probabilistic version of the LR parser, us- 
ing Good- Turing smoothing to estimate the prob- 
ability of unseen transitions in the LALR(l) ta- 
ble (Briscoe & Carroll, 1993; Carroll, 1993). The 
probabilistic parser can then return a ranking of 
all possible analyses for a sentence, or efficiently 
return just the n-most probable (Carroll, 1993). 

The probabilistic parser was tested on the 
250 sentences held out from the manually- 
disambiguated treebank (of lengths 3-56 tokens, 
mean 18.2). The parser was set up to return 

3 The appendix contains a random sample of sen- 
tences from the test corpus. 





Zero Mean Recall Precision 
crossings crossings 


Probabilistic parser analyses 
Top-ranked analysis 
Random analysis 


59.6% 1.03 74.0% 73.0% 
40.4% 1.84 58.6% 60.0% 


Manually- disambiguated analyses 
'Ideal' analysis 


80.1% 0.41 85.4% 82.9% 



Table 3: GEIG evaluation metrics for test set of 250 held-back sentences against Susanne bracketings 



only the highest-ranked analysis for each sentence. 
Table || shows the results of this test — with re- 
spect to the original Susanne bracketings — using 
the Grammar Evaluation Interest Group scheme 
(GEIG, see e.g. Harrison et at, 1991)f]. This com- 
pares unlabelled bracketings derived from corpus 
treebanks with those derived from parses for the 
same sentences by computing recall, the ratio of 
matched brackets over all brackets in the treebank; 
precision, the ratio of matched brackets over all 
brackets found by the parser; mean crossings, the 
number of times a bracketed sequence output by 
the parser overlaps with one from the treebank 
but neither is properly contained in the other, av- 
eraged over all sentences; and zero crossings, the 
percentage of sentences for which the analysis re- 
turned has zero crossings. 

The table also gives an indication of the best 
and worst possible performance of the disambigua- 
tion component of the system, showing the results 
obtained when parse selection is replaced by a sim- 
ple random choice, and the results of evaluating 
the analyses in the manually-disambiguated tree- 
bank against the corresponding original Susanne 
bracketings. In this latter figure, the mean number 
of crossings (0.41) is greater than zero mainly be- 
cause of incompatibilities between the structural 
representations chosen by the grammarian and the 
corresponding ones in the treebank. Precision is 
less than 100% due to crossings, minor mismatches 
and inconsistencies (due to the manual nature of 
the markup process) in tree annotations, and the 
fact that Susanne often favours a "flat" treatment 
of VP constituents, whereas our grammar always 
makes an explicit choice between argument- and 
adjunct-hood. Thus, perhaps a more informa- 
tive test of the accuracy of our probabilistic sys- 
tem would be evaluation against the manually- 
disambiguated corpus of analyses assigned by the 
grammar. In this, the mean crossing figure drops 

4 We would like to thank Phil Harrison for supplying 
the evaluation software. 



to 0.71 and the recall and precision rise to 83-84%, 
as shown in table ^. 

Black et at (1993:7) use the crossing brackets 
measure to define a notion of structural consis- 
tency, where the structural consistency rate for the 
grammar is defined as the proportion of sentences 
for which at least one analysis — from the many 
typically returned by the grammar — contains no 
crossing brackets, and report a rate of around 
95% for the IBM grammar tested on the com- 
puter manual corpus. However, a problem with 
the GEIG scheme and with structural consistency 
is that both are still weak measures (designed 
to avoid problems of parser/treebank represen- 
tational compatibility) which lead to unintuitive 
numbers whose significance still depends heavily 
on details of the relationship between the repre- 
sentations compared (e.g. between structure as- 
signed by a grammar and that in a treebank). One 
particular problem with the crossing bracket mea- 
sure is that a single attachment mistake embedded 
n levels deep (and perhaps completely innocuous, 
such as an "aside" delimited by dashes) can lead 
to n crossings being assigned, whereas incorrect 
identification of arguments and adjuncts can go 
unpunished in some cases. 

Schabes et at (1993) and Magerman (1995) 
report results using the GEIG evaluation scheme 
which are numerically similar in terms of parse se- 
lection to those reported here, but achieve 100% 
coverage. However, their experiments are not 
strictly comparable because they both utilise more 
homogeneous and probably simpler corpora. (The 
appendix gives an indication of the diversity of 
the sentences in our corpus). In addition, Sch- 
abes et at do not recover tree labelling, whilst 
Magerman has developed a parser designed to pro- 
duce identical analyses to those used in the Penn 
Treebank, removing the problem of spurious er- 
rors due to grammatical incompatibility. Both 
these approaches achieve better coverage by con- 
structing the grammar fully automatically, but as 





Zero Mean Recall Precision 
crossings crossings 


Probabilistic parser analyses 
Top-ranked analysis 


67.2% 0.71 82.9% 83.9% 



Table 4: GEIG evaluation metrics for test set of 250 held-back sentences against the manually-disambigated 
analyses 



an inevitable side-effect the range of text phenom- 
ena that can be parsed becomes limited to those 
present in the training material, and being able to 
deal with new ones would entail further substan- 
tial treebanking efforts. 

To date, no robust parser has been shown 
to be practical and useful for some NLP task. 
However, it seems likely that, say, rule-to-rule se- 
mantic interpretation will be easier with hand- 
constructed grammars with an explicit, determi- 
nate rule-set. A more meaningful parser compar- 
ison would require application of different parsers 
to an identical and extended test suite and utilisa- 
tion of a more stringent standard evaluation pro- 
cedure sensitive to node labellings. 

Training Data Size and Accuracy 

Statistical HMM-based part-of-speech taggers re- 
quire of the order of 100K words and upwards of 
training data (Weischedel et ai, 1993:363); tag- 
gers inducing non-probabilistic rules (e.g. Brill, 
1994) require similar amounts (Gaizauskas, pc). 
Our probabilistic disambiguation system currently 
makes no use of lexical frequency information, 
training only on structural configurations. Nev- 
ertheless, the number of parameters in the prob- 
abilistic model is large: it is the total number of 
possible transitions in an LALR(l) table contain- 
ing over 150000 actions. It is therefore interesting 
to investigate whether the system requires more 
or less training data than a tagger. 

We therefore ran the same experiment as 
above, using GEIG to measure the accuracy of 
the system on the 250 held-back sentences, but 
varying the amount of training data with which 
the system was provided. We started at the full 
amount (3793 trees) , and then successively halved 
it by selecting the appropriate number of trees at 
random. The results obtained are given in figure [l]. 

The results show convincingly that the system 
is extremely robust when confronted with limited 
amounts of training data: when using a mere one 
sixty- fourth of the full amount (59 trees), accuracy 
was degraded by only 10-20%. However, there 



is a large decrease in accuracy with no training 
data (i.e. random choice). Conversely, accuracy is 
still improving at 3800 trees, with no sign of over- 
training, although it appears to be approaching an 
upper asymptote. To determine what this might 
be, we ran the system on a set of 250 sentences ran- 
domly extracted from the training corpus. On this 
set, the system achieves a zero crossings rate of 
60.0%, mean crossings 0.88, and recall and preci- 
sion of 77.0% and 75.2% respectively, with respect 
to the original Susanne bracketings. Although this 
is a different set of sentences, it is likely that the 
upper asymptote for accuracy for the test corpus 
lies in this region. Given that accuracy is increas- 
ing only slowly and is relatively close to the asymp- 
tote it is therefore unlikely that it would be worth 
investing effort in increasing the size of the train- 
ing corpus at this stage in the development of the 
system. 

6. CONCLUSIONS 

In this paper we have outlined an approach to ro- 
bust domain-independent parsing, in which sub- 
categorisation constraints play no part, resulting 
in coverage that greatly improves upon more con- 
ventional grammar-based approaches to NL text 
analysis. We described an implemented system, 
and evaluated its performance along several dif- 
ferent dimensions. We assessed its coverage and 
that of previous versions on a development cor- 
pus and an unseen corpus, and demonstrated that 
the grammar refinement we have carried out has 
led to substantial improvements in coverage and 
reductions in spurious ambiguity. We also evalu- 
ated the accuracy of parse selection with respect 
to treebank analyses, and, by varying the amount 
of training material, we showed that it requires 
comparatively little data to achieve a good level 
of accuracy. 

We have made good progress in increasing 
grammar coverage, though we have now reached 
a point of diminishing returns. Further significant 
improvements in this area would require corpus- 
specific additions and tuning whose benefit would 
not necessarily carry over to other corpora. In the 
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Figure 1: GEIG metrics for held-back sentences, training on varying amounts of data 



application we are currently using the system for — 
automatic extraction of subcategorisation frames, 
and more generally argument structure, from large 
amounts of text (Briscoe & Carroll, 1996) — we do 
not need full coverage; 70-80% appears to be suf- 
ficient. However, further improvements in cover- 
age will require some automated approach to rule 
induction driven by parse failure. Since our eval- 
uations indicate that our system achieves a good 
level of accuracy with little trccbank data, and 
that 67-75% coverage was achieved for English 
quite early in the grammar refinement effort, port- 
ing the current system to other languages should 
be possible with small-to-mcdium-sized treebanks 
(around 20K words) and feasible manual effort 
(of the order of 12 person-months for grammar- 
writing and treebanking) . This may yield a sys- 
tem accurate enough for some types of application, 
given that the system is not restricted to return- 
ing the single highest ranked analysis but can re- 
turn the 77,-highest ranked for further application- 
specific selection. 

Although we report promising results, parse 
selection that is sufficiently accurate for many 
practical applications will require a more lexi- 
calised system. Magerman's (1995) parser is an 
extension of the history-based parsing approach 
developed at IBM (Black et al., 1993) in which 
rules are conditioned on lexical and other (es- 
sentially arbitrary) information available in the 



parse history. In future work, we intend to ex- 
plore a more restricted and semantically-driven 
version of this approach in which, firstly, probabili- 
ties are associated with different subcategorisation 
possibilities, and secondly, alternative predicate- 
argument structures derived from the grammar 
are ranked probabilistically. However, the mas- 
sively increased coverage obtained here by relaxing 
subcategorisation constraints underlines the need 
to acquire accurate and complete subcategorisa- 
tion frames in a corpus-driven fashion, before such 
constraints can be exploited robustly and effec- 
tively with free text. 
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APPENDIX 

Below is a random sample of the 250-sentence test 
set. The test set comprises the Brown genre cat- 
egories: "press reportage"; "belles lettres, biog- 
raphy, memoirs"; and "learned (mainly scientific 
and technical) writing" . 

"Yes, your honour", replied Bellows. 

This is another of the modifications of policy 
on Laos that the Kennedy administration has 
felt compelled to make. 

On Monday, the Hughes concern was formally 
declared bankrupt after its directors indicated 
they could not draw up a plan for reorganiza- 
tion. 

lerulli will replace Desmond D. Connall who 
has been called to active military service but 
is expected back on the job by March 31. 

Place kicking is largely a matter of timing, 
Moritz declared. 

Ritchie walked up to him at the magazine 
stand. 

Hector Lopez, subbing for Berra, smashed a 
3-run homer off Bill Henry during another 5- 
run explosion in the fourth. 

That's how he first won the Masters in 1958. 

Cooperman and Teller are accused of selling 
$4,700 worth of heroin to a convicted nar- 
cotics peddler, Otis Sears, 45, of 6934 Indi- 
ana av. 

However, the system is designed, ingeniously 
and hopefully, so that no one man could ini- 
tiate a thermonuclear war. 

He bent down, a black cranelike figure, and 
put his mouth to the ground. 



Those who actually get there find that it isn't 
spooky at all but as brilliant as a tile in sun- 
light. 

Others look to more objective devices of or- 
der. 

What additional roles has the scientific un- 
derstanding of the 19th and 20th centuries 
played? 

If we look at recent art we find it preoccupied 
with form. 

Hence the beatniks sustain themselves on 
marijuana, jazz, free swinging poetry, ex- 
hausting themselves in orgies of sex; some of 
them are driven over the borderline of sanity 
and lose contact with reality. 

Heidenstam could never be satisfied by sur- 
face. 

Individual human strength is needed to pit 
against an inhuman condition. 

The pressure gradient producing the jet is due 
to the nature of the magnetic field in the arc 
(rapid decrease of current density from cath- 
ode to the anode). 

At 100 Amp the 360 cycle ripple was less than 
0.5 V (peak to peak) with a resistive load. 



